Tip: Welcome to the Investigate a Dataset project! You will find tips in quoted sections like this to help organize your approach to your investigation. Once you complete this project, remove these Tip sections from your report before submission. First things first, you might want to double-click this Markdown cell and change the title so that it reflects your dataset and investigation.
Tip: In this section of the report, provide a brief introduction to the dataset you've selected/downloaded for analysis. Read through the description available on the homepage-links present here. List all column names in each table, and their significance. In case of multiple tables, describe the relationship between tables.
Tip: Clearly state one or more questions that you plan on exploring over the course of the report. You will address these questions in the data analysis and conclusion sections. Try to build your report around the analysis of at least one dependent variable and three independent variables. If you're not sure what questions to ask, then make sure you familiarize yourself with the dataset, its variables and the dataset context for ideas of what to explore.
Tip: Once you start coding, use NumPy arrays, Pandas Series, and DataFrames where appropriate rather than Python lists and dictionaries. Also, use good coding practices, such as, define and use functions to avoid repetitive code. Use appropriate comments within the code cells, explanation in the mark-down cells, and meaningful variable names.
# Use this cell to set up import statements for all of the packages that you
# plan to use.
# Remember to include a 'magic word' so that your visualizations are plotted
# inline with the notebook. See this page for more:
# http://ipython.readthedocs.io/en/stable/interactive/magics.html
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
% matplotlib inline
# Upgrade pandas to use dataframe.explode() function.
#!pip install --upgrade pandas==1.1.5
Tip: In this section of the report, you will load in the data, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you document your data cleaning steps in mark-down cells precisely and justify your cleaning decisions.
Tip: You should not perform too many operations in each cell. Create cells freely to explore your data. One option that you can take with this project is to do a lot of explorations in an initial notebook. These don't have to be organized, but make sure you use enough comments to understand the purpose of each code cell. Then, after you're done with your analysis, create a duplicate notebook where you will trim the excess and organize your steps so that you have a flowing, cohesive report.
For this project I will use data from Gapminder (https://www.gapminder.org/data/). For my analysis, I have specifically selected the following indicators:
Gapminder incorporates data from the 1950s, with projections for up 2099 for about 200 countries (some countries are excluded from certain indicators). This is a huge amount of data to analyse.
I therefore chose to limit the range of data for my analysis to historic data for the past five full decades (i.e. 1971 to 2020), as well as consider only countries in my current geography, the Southern African Development Community (SADC). I believe this focus on a specific data set of interest will allow me to draw more impactful insights than if I had gone with a general view of the entire data set.
SADC is a grouping of 16 countries in the southern most part of Africa.
#Defining the columns that I want to use across all the datasets
select_cols = ['country', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
'1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986',
'1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995',
'1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
'2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
'2014', '2015', '2016', '2017', '2018', '2019', '2020']
#Defining the countries that I want to use across all the datasets
sadc_countries = ['Angola', 'Botswana', 'Comoros', 'Congo, Dem. Rep.', 'Eswatini', 'Lesotho', 'Madagascar', 'Malawi', 'Mauritius', 'Mozambique', 'Namibia', 'Seychelles', 'South Africa', 'Tanzania', 'Zambia', 'Zimbabwe']
This section loads the data from the forur dataframes listed above by reading from the csv files downloaded from the Gapmider website.
The loaded data is then subset for the years and countries of interest.
Finally a quick look at the data to confirm that the loading and subseting were successful will be done by checking the number of rows and columns (shape) for each dataframe as well as showing the first few rows of each dataframe.
#Loading and delimiting the Income per person (GDP/capita, PPP$ inflation-adjusted) data
income_df = pd.read_csv('income_per_person_gdppercapita_ppp_inflation_adjusted.csv')
income_df = income_df[select_cols]
income_df = income_df[income_df.country.isin(sadc_countries)].reset_index(drop=True)
print(income_df.shape)
income_df.head()
#Loading and delimiting the Life expectancy (years) data
life_expectancy_all_df = pd.read_csv('life_expectancy_years.csv')
life_expectancy_all_df = life_expectancy_all_df[select_cols]
life_expectancy_all_df = life_expectancy_all_df[life_expectancy_all_df.country.isin(sadc_countries)].reset_index(drop=True)
print(life_expectancy_all_df.shape)
life_expectancy_all_df.head()
#Loading and delimiting the Life expectancy, male data
life_expectancy_male_df = pd.read_csv('life_expectancy_male.csv')
life_expectancy_male_df = life_expectancy_male_df[select_cols]
life_expectancy_male_df = life_expectancy_male_df[life_expectancy_male_df.country.isin(sadc_countries)].reset_index(drop=True)
print(life_expectancy_male_df.shape)
life_expectancy_male_df.head()
#Loading and delimiting the Life expectancy, female data
life_expectancy_female_df = pd.read_csv('life_expectancy_female.csv')
life_expectancy_female_df = life_expectancy_female_df[select_cols]
life_expectancy_female_df = life_expectancy_female_df[life_expectancy_female_df.country.isin(sadc_countries)].reset_index(drop=True)
print(life_expectancy_female_df.shape)
life_expectancy_female_df.head()
Tip: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. You should compute the relevant statistics throughout the analysis when an inference is made about the data. Note that at least two or more kinds of plots should be created as part of the exploration, and you must compare and show trends in the varied visualizations.
Tip: - Investigate the stated question(s) from multiple angles. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables. You should explore at least three variables in relation to the primary question. This can be an exploratory relationship between three variables of interest, or looking at how two independent variables relate to a single dependent variable of interest. Lastly, you should perform both single-variable (1d) and multiple-variable (2d) explorations.
The exploratory data analysis for each dataframe is a repetitive process with the following steps:
Initial data cleaning: Because the income_df data used 'k' to denote thousands, i need to do some initial data cleaning to replace the 'k' with numeric thousands (000) so that I can explore the income_data data as numberic
#Creating a list of only the year columns
years = ['1971', '1972', '1973', '1974', '1975', '1976', '1977','1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986','1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995','1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004','2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013','2014', '2015', '2016', '2017', '2018', '2019', '2020']
#replacing 'k' with numeric 000's
replace_k = income_df[years].replace({'k': 'e+03'}, regex=True).astype(float)
#Updating the income_df dataframe with the numeric years columns
income_df.drop(years, axis=1, inplace=True)
income_df = pd.concat([income_df, replace_k], axis=1)
#Checking for null values in the income_df - there are none (0)
(income_df.isnull().sum() > 0).sum()
#Confirming the data types of the income_df columns
income_df.dtypes
#Plotting the average GDP per capita income for all SADC countries
plt.figure(figsize=(16,4))
income_df.mean(axis=0).plot();
plt.title("SADC GDP per capita income over the years 1971 - 2020");
plt.ylabel("Income in $");
plt.xlabel("Years: 1971 - 2020");
#A look at the boxplot version of the average GDP per capita income for all SADC countries
#This views shows the skew of the data as well as any outliers
income_df.plot(figsize=(24,6), kind='box');
plt.title("SADC GDP per capita income over the years 1971 - 2020 - Boxplot");
#A look at the histogram version of the average GDP per capita income for all SADC countries
#this view shows the skew of the data as well as the relative size of each grouping (i.e., number of countries in each grouping)
fig, ax = plt.subplots(10, 5, figsize=(16, 30))
income_df.hist(ax=ax);
#A look at the descriptive statistics of the dataset
income_df.describe()
#Checking the number of null values
(life_expectancy_all_df.isnull().sum() > 0).sum()
#Confirming the data types of each column in the dataframe
life_expectancy_all_df.dtypes
#A look at the descriptive statistics of the dataset
life_expectancy_all_df.describe()
#Plotting the average combined life expectancy for all SADC countries
plt.figure(figsize=(16,4))
life_expectancy_all_df.mean(axis=0).plot();
plt.title("SADC combined average life expectancy over the years 1971 - 2020");
plt.ylabel("life expectancy in years");
plt.xlabel("Years: 1971 - 2020");
#A look at the boxplot version of the average combined life expectancy for all SADC countries
#This views shows the skew of the data as well as any outliers
life_expectancy_all_df.plot(figsize=(24,6), kind='box');
#A look at the histogram version of the average combined life expectancy for all SADC countries
#this view shows the skew of the data as well as the relative size of each grouping (i.e., number of countries in each grouping)
fig, ax = plt.subplots(10, 5, figsize=(16, 30))
life_expectancy_all_df.hist(ax=ax);
#Checking the number of null values
(life_expectancy_male_df.isnull().sum() > 0).sum()
#Confirming the data types of each column in the dataframe
life_expectancy_male_df.dtypes
#A look at the descriptive statistics of the dataset
life_expectancy_male_df.describe()
#Plotting the average male life expectancy for all SADC countries
plt.figure(figsize=(16,4))
life_expectancy_male_df.mean(axis=0).plot();
plt.title("SADC average life expectancy for males over the years 1971 - 2020");
plt.ylabel("life expectancy in years");
plt.xlabel("Years: 1971 - 2020");
#A look at the boxplot version of the average male life expectancy for all SADC countries
#This views shows the skew of the data as well as any outliers
life_expectancy_male_df.plot(figsize=(24,6), kind='box');
#Checking the number of null values
(life_expectancy_female_df.isnull().sum() > 0).sum()
#Confirming the data types of each column in the dataframe
life_expectancy_female_df.dtypes
#A look at the descriptive statistics of the dataset
life_expectancy_female_df.describe()
#Plotting the average female life expectancy for all SADC countries
plt.figure(figsize=(16,4))
life_expectancy_female_df.mean(axis=0).plot();
plt.title("SADC average life expectancy for females over the years 1971 - 2020");
plt.ylabel("life expectancy in years");
#A look at the boxplot version of the average female life expectancy for all SADC countries
#This views shows the skew of the data as well as any outliers
life_expectancy_female_df.plot(figsize=(24,6), kind='box');
#Histograms with a comparison of the male and female average life expectancies for all SADC countries for each year
#PINK = Female life expectancy histograms
#Light black/Grey = Maleale life expectancy histograms
fig, ax = plt.subplots(10, 5, figsize=(16, 30))
life_expectancy_male_df.hist(ax=ax, alpha=0.3, color='black');
life_expectancy_female_df.hist(ax=ax, alpha=0.8, color='pink');
The EDA steps above revealed a few interesting things and informed the questions that I will answer in the next section. A few of the key observations are listed below:
My analyisis will attempt to answer the following questions:
Each of these questions will be answered by means of visualisations in the section that follow.
For this question, all the line graphs from the EDA section are collated in one visualisation so that comparisons can be made. The visualisation makes use of two different y-axes, one with the life expectancy in years and the other with the per capita income in dollars.
The lines each depict the average (mean) values for each of the years for all the SADC countries combined.
#Plot the average (mean) values for each of the years for all the SADC countries combined for each indicator.
fig,ax = plt.subplots(figsize=(16,5))
ax2=ax.twinx();
income_df.mean(axis=0).plot(ax=ax2, color='green', linestyle='dotted', label='income', marker = '+')
life_expectancy_all_df.mean(axis=0).plot(ax=ax, color='brown', label='all_life_expectancy', marker = 'o')
life_expectancy_male_df.mean(axis=0).plot(ax=ax, color='black', label='male_life_expectancy', marker = '.', alpha=0.3)
life_expectancy_female_df.mean(axis=0).plot(ax=ax, color='pink', label='female_life_expectancy', marker = '.', alpha=0.8)
plt.ylabel('per capita income in $', color='green')
ax.set_ylabel('life expectancy in years', color='brown')
ax.set_xlabel('Time period (years): from 1971 to 2020')
plt.title('SADC life expectancy and GDP per capita income over the years 1971 - 2020')
ax.legend(loc='center left');
ax2.legend(loc='center right');
The observations from the visual above seem to generally support the notion that as the average GDP per capita income for SADC rose during the years 1971 to 2020 it was accompanied by a rise in the life expectancies as well. This seems to be true for all the life expectancy indicators that we are tracking, i.e., combined, male and female expectancies.
One exception to this is during the years roughly in the 1991 to 2000 (and into the early 2000's) where per capita income is rising but the life expectancy indicators for that period are falling.
The other exception is around the years 2019/2020 where average per capita income dips sharply but the life expectancy indicators do not dip as sharply, the combined life expectancy figure does show signs of plateauing during this 2019/2020 but only so slightly.
On average, rising per capita income does seem to be related to the life expectancy indicators for the SADC countries. The few exceptions to this may be explained with other events that were happening in the macro-environment. This will be touched on in the conclusion section.
This section follows the same approach as question 1, but this time I am looking at the indicators country by country (as opposed to taking the average for the entire SADC region)
#Defining a function that loops through the list of SADC countries and for each country produce a visualisation that
#plots the values for each indicator for the country.
def view_country(country):
'''
Takes a country name and shows the income and life expectancy graphs for the specified country
country (string) - name of the country
'''
fig,ax = plt.subplots(figsize=(16,5))
ax2=ax.twinx();
income_df[income_df.country == country].mean(axis=0).plot(ax=ax2, color='green', linestyle='dotted', label='income', marker = '+')
life_expectancy_all_df[life_expectancy_all_df.country == country].mean(axis=0).plot(ax=ax, color='brown', label='all_life_expectancy', marker = 'o')
life_expectancy_male_df[life_expectancy_male_df.country == country].mean(axis=0).plot(ax=ax, color='black', label='male_life_expectancy', marker = '.', alpha=0.3)
life_expectancy_female_df[life_expectancy_female_df.country == country].mean(axis=0).plot(ax=ax, color='pink', label='female_life_expectancy', marker = '.', alpha=0.8)
plt.ylabel('per capita income in $', color='green')
ax.set_ylabel('life expectancy in years', color='brown')
ax.set_xlabel('Time period (years): from 1971 to 2020')
plt.title(country +' life expectancy and GDP per capita income over the years 1971 - 2020')
ax.legend(loc='upper left');
ax2.legend(loc='center right');
#Call the function to print the visualisation per country
for country in sadc_countries:
view_country(country)
Looking at the visualisations country by country brings interesting insights into how the individual countries contrinuted to the average trend for SADC in question 1. It is worth noting that although the country by country trend analysis roughly shows the same trend as the average in most cases, there are some countries that exhibit the trend more severely than others, whereas other countries seem to be less impacted. There are also some countries that seem to be trending against the average trend in some indicators or for certain periods during the period under review. A few of these observations are noted below.
From the visualisations in the first two questions it seems the female life expectancy indicator figures are generally higher than the male life expectancy figures. This questions investigates if this trend is indeed sustained for every country and every year under review.
The approach to this question is to melt the dataframes so that the years appear as row values under the 'year' column for each year for each country. The male life expectancy figure for each data point is then subtracted from the equivalent female life expectancy figure and where the difference is a positive number (confirming that the female life expectancy is higher), the difference cell is populated with a green bar. Where this is false the bar color will be red
#Melt the dataframes to have the Year columns run as values in the 'year' column
income_df_melt = pd.melt(income_df, id_vars='country', var_name='year', value_name='income_per_capita')
life_expectancy_all_df_melt = pd.melt(life_expectancy_all_df, id_vars='country', var_name='year', value_name='all_life_expectancy')
life_expectancy_male_df_melt = pd.melt(life_expectancy_male_df, id_vars='country', var_name='year', value_name='male_life_expectancy')
life_expectancy_female_df_melt = pd.melt(life_expectancy_female_df, id_vars='country', var_name='year', value_name='female_life_expectancy')
#Merge all four dataframes with the unique combination of country and year as the index
df_merge_1 = income_df_melt.merge(life_expectancy_all_df_melt, how='inner',on=['country', 'year'])
df_merge_2 = df_merge_1.merge(life_expectancy_male_df_melt, how='inner',on=['country', 'year'])
df_merge_3 = df_merge_2.merge(life_expectancy_female_df_melt, how='inner',on=['country', 'year'])
df_combined = df_merge_3.reset_index(drop=True).set_index(['country', 'year'])
#Add a column that calculates the difference between the Female life expectancy and the male life expectancy...
#...for each index (i.e. unique pairing of country and year)
#Where Female life expectancy is greater than Male life expectancy show a green bar, and a red bar where Male life expectancy is higher
#SPOILER ALERT: All bars are Green which means females generally lived longer in the SADC countries for the period under review
df_combined['diff_female_male_life_expectancy'] = (df_combined['female_life_expectancy'] - df_combined['male_life_expectancy'])
df_combined = df_combined.style.bar(subset=['diff_female_male_life_expectancy'], align='mid', color=['#d65f5f', '#5fba7d'])
df_combined
The above visualisation confirms that for the period under review, and for every country and every year, the female life expectancy is higher than the male life expectancy.
For this last question, I take a deep dive into each of the indicators. I do this by plotting on a single visualisation, the values for each of the SADC countries across the years 1971 to 2020 for the specific indicator. I also include the line for the average SADC figure for the indicator so that it is instantly clear how each country is performing against the SADC average for each indicator for each year.
#Plotting the per capita income for each SADC country and the average across the years 1971 to 2020
fig,ax = plt.subplots(figsize=(24,12));
income_df.set_index('country').T.plot(ax=ax);
income_df.mean(axis=0).plot(ax=ax, color='black', label='SADC average', marker = 'o', linestyle="dotted");
ax.legend(loc='upper left',fontsize='x-large');
plt.title('SADC per capita income for each country over the years 1971 - 2020');
ax.set_xlabel('Time period (years): from 1971 to 2020');
#Plotting the per capita income for each SADC country and the average across the years 1971 to 2020
#This version shows only the countries that are tracking below the average line
fig,ax = plt.subplots(figsize=(24,12));
income_df.set_index('country').drop(['Seychelles', 'Mauritius', 'Botswana', 'South Africa', 'Namibia', 'Eswatini']).T.plot(ax=ax);
income_df.mean(axis=0).plot(ax=ax, color='black', label='SADC average', marker = 'o', linestyle="dotted");
ax.legend(loc='upper left',fontsize='x-large');
plt.title('SADC per capita income for each low-income country (below the SADC average only) over the years 1971 - 2020');
ax.set_xlabel('Time period (years): from 1971 to 2020');
#Plotting the combined life expectancy for each SADC country and the average across the years 1971 to 2020
fig,ax = plt.subplots(figsize=(24,12));
life_expectancy_all_df.set_index('country').T.plot(ax=ax);
life_expectancy_all_df.mean(axis=0).plot(ax=ax, color='black', label='SADC average', marker = 'o', linestyle="dotted");
ax.legend(loc='upper left',fontsize='x-large');
plt.title('SADC combined life expectancy for each country over the years 1971 - 2020');
ax.set_xlabel('Time period (years): from 1971 to 2020');
#Plotting the male life expectancy for each SADC country and the average across the years 1971 to 2020
fig,ax = plt.subplots(figsize=(24,12));
life_expectancy_male_df.set_index('country').T.plot(ax=ax);
life_expectancy_male_df.mean(axis=0).plot(ax=ax, color='black', label='SADC average', marker = 'o', linestyle="dotted");
ax.legend(loc='upper left',fontsize='x-large');
plt.title('SADC male life expectancy for each country over the years 1971 - 2020');
ax.set_xlabel('Time period (years): from 1971 to 2020');
#Plotting the female life expectancy for each SADC country and the average across the years 1971 to 2020
fig,ax = plt.subplots(figsize=(24,12));
life_expectancy_female_df.set_index('country').T.plot(ax=ax);
life_expectancy_female_df.mean(axis=0).plot(ax=ax, color='black', label='SADC average', marker = 'o', linestyle="dotted");
plt.title('SADC female life expectancy for each country over the years 1971 - 2020');
ax.legend(loc='upper left',fontsize='x-large');
The report on this project can be found on: https://medium.com/@caviebepura/5-decades-of-promise-for-southern-africa-9bfb7dadedf1
Generally, the data does seem to indicate that the SADC countries have generally been on a rising trend with both the per capita income growth and life expectancy. Another conclusion that the data seems to support is that life expectancy and income seem to rise in tandem.
However, there are a few exceptions to these generalizations. Firstly, Madagascar and the Congo, Democratic Republic seem to have falling per capita incomes for the period under review. Zimbabwe is another country that has had sustained periods of decline on this indicator as well, although there is a period of resurgence post-2008 before stalling again around 2012. It will be interesting to dig deeper into these case studies and see what could have caused this.
The dips in the SADC average on the per capita indicator are curiously coinciding with the global financial crisis of 2008 and the global COVID-19 pandemic of 2019/2020. Again, additional data and analysis into this observation may lead to interesting results.
On the life expectancy trend, I was drawn to sustained decline in the life expectancy during the period 1991 to 2000. UNAIDS statistics indicate that this is the period that HIV & AIDS deaths were most prevalent in Sub-Saharan Africa. I would be interested in pursuing future studies into how this may explain this trend in life expectancy.
In conclusion, I believe this project opens up the options of what can be achieved when various indicators are tracked and compared to each other to see if there are any interesting observations. These observations may then lead to more detailed and scientific studies into how certain phenomenon can be explained with data and trends. This information can play a major role in informing governance, policy setting and focus of funding.
For the data: https://www.gapminder.org/data/
For the funding of this program: https://www.alx-t.com/
Tip: Finally, summarize your findings and the results that have been performed in relation to the question(s) provided at the beginning of the analysis. Summarize the results accurately, and point out where additional research can be done or where additional information could be useful.
Tip: Make sure that you are clear with regards to the limitations of your exploration. You should have at least 1 limitation explained clearly.
Tip: If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!
Tip: Once you are satisfied with your work here, check over your report to make sure that it is satisfies all the areas of the rubric (found on the project submission page at the end of the lesson). You should also probably remove all of the "Tips" like this one so that the presentation is as polished as possible.
Tip: Before you submit your project, you need to create a .html or .pdf version of this notebook in the workspace here. To do that, run the code cell below. If it worked correctly, you should get a return code of 0, and you should see the generated .html file in the workspace directory (click on the orange Jupyter icon in the upper left).
Tip: Alternatively, you can download this report as .html via the File > Download as submenu, and then manually upload it into the workspace directory by clicking on the orange Jupyter icon in the upper left, then using the Upload button.
Tip: Once you've done this, you can submit your project by clicking on the "Submit Project" button in the lower right here. This will create and submit a zip file with this .ipynb doc and the .html or .pdf version you created. Congratulations!
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])